AITopics | vision-language tracking

Collaborating Authors

vision-language tracking

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

VastTrack: Vast Category Visual Object Tracking

Neural Information Processing SystemsFeb-18-2026, 14:21:05 GMT

V astTrack consists of a few attractive properties: (1) V ast Object Category . In particular, it covers targets from 2,115 categories, significantly surpassing object classes of existing popular benchmarks ( e.g ., GOT -10k with 563 classes and LaSOT with 70 categories). Through providing such vast object classes, we expect to learn more general object tracking.

artificial intelligence, machine learning, object-oriented architecture, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas (0.14)
South America > Brazil (0.04)
Oceania > New Zealand (0.04)
(10 more...)

Genre: Research Report (0.46)

Industry:

Transportation > Passenger (1.00)
Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
(12 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.90)
Information Technology > Artificial Intelligence > Robots (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Divert More Attention to Vision-Language Tracking

Neural Information Processing SystemsDec-23-2025, 20:58:51 GMT

Relying on Transformer for complex visual feature learning, object tracking has witnessed the new standard for state-of-the-arts (SOTAs). However, this advancement accompanies by larger training data and longer training period, making tracking increasingly expensive. In this paper, we demonstrate that the Transformer-reliance is not necessary and the pure ConvNets are still competitive and even better yet more economical and friendly in achieving SOTA tracking. Our solution is to unleash the power of multimodal vision-language (VL) tracking, simply using ConvNets. The essence lies in learning novel unified-adaptive VL representations with our modality mixer (ModaMixer) and asymmetrical ConvNet search. We show that our unified-adaptive VL representation, learned purely with the ConvNets, is a simple yet strong alternative to Transformer visual features, by unbelievably improving a CNN-based Siamese tracker by 14.5% in SUC on challenging LaSOT (50.7%$\rightarrow$65.2%),

attention, name change, vision-language tracking, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback

ec17a52ea4d42361ce8dde2e17dcea05-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 20:30:01 GMT

artificial intelligence, machine learning, vasttrack, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas (0.14)
South America > Brazil (0.04)
Oceania > New Zealand (0.04)
(10 more...)

Genre: Research Report (0.45)

Industry:

Transportation > Passenger (1.00)
Transportation > Marine (1.00)
Transportation > Infrastructure & Services (1.00)
(13 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (0.92)
Information Technology > Communications (0.67)
Information Technology > Artificial Intelligence > Robots (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

Wang, Xiao, Jin, Liye, Lou, Xufeng, Wang, Shiao, Chen, Lan, Jiang, Bo, Zhang, Zhipeng

arXiv.org Artificial IntelligenceAug-8-2025

Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.05221

Country:

Asia > China > Anhui Province > Hefei (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.74)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.67)

Add feedback

MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts

Neural Information Processing SystemsMay-26-2025, 17:43:44 GMT

Vision-language tracking (VLT) enhances traditional visual object tracking by integrating language descriptions, requiring the tracker to flexibly understand complex and diverse text in addition to visual information. However, most existing vision-language trackers still overly rely on initial fixed multimodal prompts, which struggle to provide effective guidance for dynamically changing targets. Fortunately, the Complementary Learning Systems (CLS) theory suggests that the human memory system can dynamically store and utilize multimodal perceptual information, thereby adapting to new scenarios. Inspired by this, (i) we propose a Memory-based Vision-Language Tracker (MemVLT). By incorporating memory modeling to adjust static prompts, our approach can provide adaptive prompts for tracking guidance.

adaptive memory-based prompt, artificial intelligence, machine learning, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.42)
Information Technology > Artificial Intelligence > Cognitive Science (0.40)

Add feedback

How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking

Li, Xuchen, Hu, Shiyu, Feng, Xiaokun, Zhang, Dailing, Wu, Meiqi, Zhang, Jing, Huang, Kaiqi

arXiv.org Artificial IntelligenceNov-23-2024

Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information, providing semantic guidance to enhance tracking performance under challenging conditions like fast motion and deformations. However, current VLT trackers often underperform compared to single-modality methods on multiple benchmarks, with semantic information sometimes becoming a "distraction." To address this, we propose VLTVerse, the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT. Our contributions include: (1) VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT; (2) leveraging 60 subspaces formed by combinations of challenge factors and semantic types, we conduct systematic fine-grained evaluations of three mainstream SOTA VLT trackers, uncovering their performance bottlenecks across complex scenarios and offering a novel perspective on VLT evaluation; (3) through decoupled analysis of experimental results, we examine the impact of various semantic types on specific challenge factors in relation to different algorithms, providing essential guidance for enhancing VLT across data, evaluation, and algorithmic dimensions. The VLTVerse, toolkit, and results will be available at \url{http://metaverse.aitestunion.com}.

natural language, text processing, vision-language tracking, (3 more...)

arXiv.org Artificial Intelligence

2411.156

Genre: Research Report (0.40)

Technology: